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Abstract 

Graph  partitioning  is  a  topic  of  extensive  interest,  with  applications  to  parallel  processing.  In 
this  context  graph  nodes  typically  represent  computation,  and  edges  represent  communication. 
One  seeks  to  distribute  the  workload  by  partitioning  the  graph  so  that  every  proces.sor  has 
approximately  the  same  workload,  and  the  communication  cost  (me2tsured  as  a  function  of 
edges  exposed  by  the  partition)  is  minimized.  Measures  of  partition  quality  vary;  in  this  paper 
we  consider  a  processor’s  cost  to  be  the  sum  of  its  computation  and  communication  costs,  and 
consider  the  cost  of  a  partition  to  be  the  boUleneck,  or  maximal  processor  cost  induced  by  the 
partition.  For  a  general  graph  the  problem  of  finding  an  optimal  partitioning  is  intractable. 
In  this  paper  we  restrict  our  attention  to  the  class  of  ife-ary  ti-cube  graphs  with  uniformly 
weighted  nodes.  Given  mild  restrictions  on  the  node  weight  and  number  of  processors,  we 
identify  partitions  yielding  the  smallest  bottleneck.  We  also  demonstrate  by  example  that  some 
restrictions  are  necessary  for  the  partitions  we  identify  to  be  optimal.  In  particular,  there  exist 
cases  where  partitions  that  evenly  partition  nodes  need  not  be  optimal. 
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contract  number  NASI- 19480  while  the  author  was  in  residence  at  the  Institute  for  Gomputer  Applications  in  .Science 
and  Engineering  (It^ASE),  NASA  Langley  Research  C^enter,  Hampton,  VA,  23681. 
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1  Introduction 


The  problem  of  assigning  workload  in  a  parallel  system  has  long  been  viewed  as  important,  and 
in  the  general  case,  as  intractable.  A  significant  amount  of  research  has  addressed  the  problem  of 
finding  good,  if  not  optimal,  workload  mappings;  a  number  of  different  objective  functions  have 
been  used.  All  relevant  objective  functions  recognize  that  the  quality  of  both  load  balance  and  com¬ 
munication  costs  are  important.  While  workload  imbalance  is  generally  defined  as  a  large  deviation 
between  the  maximum  and  average  load  among  processors,  treatments  of  communication  costs  dif¬ 
fer.  A  common  technique  is  to  measure  the  communication  cost  as  the  sum  of  all  communication 
induced  by  the  mapping.  While  this  sometimes  leads  to  more  tractable  treatments  (e.g.  [X,  12]),  it 
does  not  capture  the  fact  that  communication  can  happen  in  parallel.  An  alternative  formulation  is 
to  assess  the  sum  of  computation  and  communication  for  each  i)rocessor,  and  measure  the  quality 
of  the  mapping  as  the  maximum  processor  load,  or  bottleneck  [,'j,  13].  The  bottleneck  measure  does 
not  take  precedence  relationships  into  consideration,  and  so  is  most  useful  in  highly  data-parallel 
computations  where  processors  typically  cycle  through  computation  and  communication  phases. 

In  this  paper  we  assume  that  a  very  regular  graph — a  fc-ary  7t-cube[6] — describes  the  computa¬ 
tion  and  communication  needs  of  a  data-parallel  problem.  Each  node  in  the  graph  represents  some 
piece  of  computational  work,  which  we  assume  takes  w  time  to  perform.  Each  edge  {i,j)  represents 
some  implicit  communication  necessary  between  nodes  i  and  j;  typically  such  an  edge  reflects  a 
data  dependency  of  node  i’s  computation  for  the  present  iteration  on  the  result  of  executing  node 
j  in  the  previous  iteration  (and  vice-versa).  The  edges  may  be  viewed  as  communication  that 
must  occur  at  the.  end  of  an  iteration.  We  desire  to  partition  the  graph  into  p  node  sets,  assigned 
one  per  processor,  so  as  to  minimize  the  bottleneck  cost.  The  problem  is  not  entirely  academic. 

Several  current  parallel  architectures  have  communication  topologies  based  on  the  A:-ary  7i-cube. 

The  problem  of  partitioning  a  communication  topology  arises,  for  instance,  when  one  executes  a 
parallel  simulation  of  traffic  on  a  fc-ary  7J-cube  network  [7,  1]. 

The  objective  of  this  paper  is  to  show  that  under  mild  restrictions  on  w  and  p,  the  optimal 
partition  is  intuitive,  one  that  equi-partitions  the  graph  into  node  sets  that  are  internally  clustered 
as  tightly  as  possible.  The  main  requirement  turns  out  to  be  that  p  be  large  enough  relative  to  the 
size  of  the  A;-ary  77-cube.  The  central  point  of  interest  is  that  restrictions  on  w  and  p  air  needed; 
while  intuitive,  our  results  are  not  at  all  immediate.  We  also  point  out  that  previous  analyses  of 
partitioning  regular  grids  differ  from  the  current  work  in  an  subtle  but  important  way.  It  is  not 
the  objective  of  the  paper  to  give  new  partitioning  algorithms,  but  to  clarify  one’s  intuition  about 
partitioning  A;-ary  7i-cubes. 

There  are  three  bodies  of  work  on  graph  partitioning  that  bear  discussion.  The  technique  of 
recursive  spectral  dissection  (e.g.,  [2])  divides  a  graph  into  two  pieces,  based  on  an  eigenvalue 

analysis  of  a  matrix  describing  the  graph  connectivity.  The  algorithm  is  applied  recursively  until _ 

p  =  2^  node  sets  are  defined.  Each  partition  cut  is  guaranteed  to  achieve  a  certain  level  of  load  r 
balance  (not  necessarily  perfect  balance),  with  a  guaranteed  upper  bound  on  the  number  of  edges 
cut.  Spectral  dissection  may  find  some  of  the  partitions  we  identify  as  optimal  (when  k  is  a  power  □ 

of  two),  but  is  not  guaranteed  to  find  themV  Recursive  geometric  partitioning  (e.g.  [9])  is  similar  □ 

in  spirit,  but  different  in  details.  A  graph  in  72’*  is  projected  onto  the  unit  sphere  in  ,  and  the  ' - 

projection  is  stretched  to  locate  the  center  of  mass  (approximately)  at  the  sphere’s  origin.  A  great 
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circle  cut  of  the  sphere  partitions  the  node  set  into  two  pieces.  The  technique  also  guarantees  a 
certain  level  of  load  balance  and  bounds  the  number  of  edges  cut.  Like  spectral  partitioning,  the 
method  may  find  the  optimal  partitions  (in  the  same  special  case  of  k  being  a  power  of  two),  but 
also  may  not.  On  the  other  hand,  recursive  binary  dissection  [3]  (and  its  extension,  parametric 
recursive  binary  dissection  [4])  will  find  the  partitions  we  identify  as  optimal,  when  k  is  a  power 
of  2.  In  the  case  of  general  graphs  there  is  no  such  guarantee.  The  heuristic  described  in  [11]  is 
shown  there  to  find  optimal  partitions  of  and  obvious  extensions  to  heuristic  described  in  [10] 
will  find  all  optimal  partitions  identified  in  this  paper,  provided  the  correct  number  of  processors 
in  each  dimension  are  supplied  in  the  problem  description. 


2  Problem  Formulation 


A  A;-ary  n-cube  Nk,n  is  a  graph  with  nodes,  with  an  edge  defined  between  two  nodes  i  and  j 
if,  in  the  base-A;  number  system,  the  expressions  of  i  and  j  differ  in  at  most  one  digit,  and  differ 
there  (modulo  k)  by  exactly  1.  Thus,  if  i  =  5„_i6„_2  •  • -fio  is  the  base-A;  representation,  then  in 
each  dimension  j  =  1 , . . . ,  n,  t  shares  an  edge  with  i'  =  6„_i  6,4-2  ‘ '  (^j  +  1 )  "  ^*0  and 

with  i"  =  6„_] 6,1-2  ~  1)  iiiod  k  bj-\  •••bo.  These  edges  are  said  to  be  in  dimension  j,  and 

i'  and  i"  are  said  to  be  dimension  j  neighbors  of  i.  Special  cases  include  rings  (Afc  i),  hypercubes 
(A2,,,),  two  and  three  dimension  toruses  (Nk,2^  ^k,3)-  It  is  useful  to  imagine  Nk,n  as  a  collection  of 
interconnected  rings  resident  in  an  n-dimensionaJ  space. 

A  partition  of  Nk,u  into  p  subdomains  is  a  collection  of  nonempty  node  subsets  V  =  {Po»  ■  ■  ■■,  fp-i  }• 
Abusing  usual  notation,  we’ll  denote  that  an  edge  e  has  at  least  one  endpoint  in  Pi  by  e  €  Pi,  and 
define  the  indicator  function  /(e,  /'  )  to  be  one  if  exactly  one  of  e’s  endpoints  is  in  Pi,  and  zero 
otherwise.  Then  we  denote  the  number  of  external  edges  in  Pi  by 

Ext{Pi)=  J2He,Pi), 

eeP. 

denote  the  number  of  internal  edges  as 

/n<(Pi)=  X^(l-/(e,Pi)), 

e€P, 


and  define  the  cost  of  Pi  as 


C{P,)  =  w\Pi\  +  Ext{Pi). 


Here  we  weight  the  cost  of  each  node  by  w  to  reflect  the  execution  cost,  where  the  communication 
cost  associated  with  one  edge  is  unity.  The  cost  of  V  is  taken  as 


BIV)  =  max  C{Pi). 

0<i<p 

(liven  p  and  w,  we  wish  to  find  the  partition  P  that  minimizes  B{V). 

A  very  similar  special  case  of  this  problem  has  been  studied  in  the  context  of  partitioning  grids 
arising  from  the  discretization  of  domains  for  the  solution  of  partial  differential  equations,  by  Reed 
et  al.  [14].  It  is  instructive  to  consider  the  subtle  difference  in  the  problem  specification,  because 
the  conclusions  reached  differ  greatly. 

The  partitions  considered  by  Reed  et  al.  all  tessellate  a  two-dimensional  domain  (A*,., 2  without 
wraparound  edges)  with  a  common  shape,  e.g.,  rectangles,  squares,  or  hexagons.  The  compulation 
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to  communication  ratio  of  different  shapes  are  analyzed,  but  the  communication  cost  is  taken  as 
the  sum  (over  all  grid  points  in  the  subgraph)  of  the  cost  of  communicating  each  boundary  point. 
This  may  vary  from  point  to  point.  For  instance.  Figure  1  illustrates  some  hexes;  point  A  has  two 
edges  cut,  but  since  the  endpoints  of  both  edges  are  in  the  same  hex,  Reed  et  al.  count  the  cost  as 
one,  not  two.  Point  B  has  two  edges  cut,  but  both  of  these  are  counted.  With  this  measure,  the 
communication  cost  of  a  hex  is  taken  as  10  although  14  edges  are  cut.  Shapes  like  hexagons  are 
shown  to  achieve  a  better  computation/communication  ratio  than  do  squares.  This  is  interesting, 
because  in  this  case  our  results  give  general  conditions  under  which  squares  are  optimal,  a  significant 
difference  due  entirely  to  a  minor  change  in  the  model  of  communication  costs. 

Reed  et  al.’s  measure  makes  sense  in  its  presented  context  where  a  specific  numerical  algorithm 
calls  for  the  exchange  of  boundary  value  grid  points.  In  other  contexts  unique  edges  from  a  node 
represent  unique  pieces  of  information,  and  the  cost  function  we  adapt  is  appropriate.  We  are  aware 
of  algorithms  in  computational  fluid  dynamics,  for  instance,  where  there  is  a  unique  “flow”  along 
every  edge  in  a  mesh.  Most  of  the  grid  partitioning  community  counts  cut  edges. 

While  our  results  identify  general  conditions  under  which  equi-partitions  are  optimal  for  the 
bottleneck  measure,  it  is  worthwhile  noting  that  this  need  not  always  be  the  case.  An  example 
that  partitions  a  6  x  6  mesh  into  3  partition  elements  is  shown  in  Figure  2.  Here  the  unbalanced 
partition  has  bottleneck  cost  28m  +  10,  the  balanced  partition  has  bottleneck  cost  12u)  +  12.  The 
unbalanced  partition  is  better  whenever  w  <  6/19.  This  example  illustrates  the  tension  between 
partitioning  to  minimize  computational  imbalance  and  communication  overhead.  Our  goal  is  find 
general  conditions  under  which  obvious  equi-partitions  are  optimal  with  respect  to  the  bottleneck 
iiietric. 
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Unbalanced  partition  has 
cost  28w+12 


Balanced  partition  has 
cost  12w+14 


Figure  2:  Equal  sized  partitions  need  not  be  optimal 


3  Preliminaries 

We  first  establish  some  preliminary  results.  These  depend  on  k  in  a  way  that  is  captured  by  defining 
Tk  =  1  for  k  =  2,  and  Tk  =  2  for  k  >  2. 

Observation  1  Let  A  be  any  set  of  nodes  in  Nk,n-  If  |.4|  =  m  and  Int{A)  =  v,  then  Ext(A)  = 
TkVin  —  2v. 

Lemma  2  Let  A  be  any  set  of  nodes  in  Nk,n>  k  ^  3,  with  \A\  =  m.  Then  Int{A)  <  (mlogm)/2. 
This  bound  is  achieved  when  m  =  2^  for  some  j  <  n. 

Proof:  We  induct  on  m.  The  base  case  of  m  =  1  is  trivially  satisfied.  Suppose  then  that  the  claim 
is  true  for  any  set  of  size  in  —  1  or  smaller,  and  choose  any  node  set  A  with  |i4|  =  m.  Choose  any 
two  nodes  x  and  y  in  A,  consider  their  indices  expressed  in  base-A;  notation  and  find  a  dimension 
j  in  which  their  indices  differ  in  that  notation.  Let  a  and  b  be  the  dimension  j  index  for  x  and  y 

respectively.  Viewing  these  indices  as  lying  on  a  “ring”  0  -  1  —  2  - - (A-  -  1)  -  0,  cut  the  ring 

into  t\7o  sequences  of  length  2  or  greater,  one  of  which  contains  a,  and  one  of  which  contains  b. 
Partition  A  into  sets  Xa  and  Xb,  with  Xa  comprised  of  all  nodes  whose  indices  in  dimension  j  lie 
in  the  same  range  as  a’s,  and  Xb  =  A  —  Xa-  Let  u  and  m  —  u  be  the  number  of  nodes  in  X  and 
Y  respectively.  By  the  induction  hypothesis,  Xa  has  no  more  than  (wlogii)/2  internal  edges,  and 
Xb  has  no  more  than  ((m  -  u)log(rri  -  u)/2)  internal  edges.  If  k  =  2  or  if  A  >  4  there  can  be  no 
more  than  min{u,  77i  —  u}  edges  between  and  Xb,  because  any  such  edge  has  to  connect  nodes 
whose  indices  differ  only  in  dimension  j,  and  which  must  be  adjacent  on  the  ring  we  partitioned. 
Any  node  in  either  set  can  have  at  most  one  edge  to  the  other  set.  It  follows  that  A  can  have  no 
more  than 

Bm{u)  =  (ulogu)/2  +  ((m  -  u)log(m  -  u))/2  +  min{u,77i  -  ii}. 

Now  the  function 

/m(g)  =  (glogq)/2  +  ((m  -  q)]og(m  -  qX)/2  +  q 

defined  over  q  €  [0, 7n/2]  completely  describes  the  bound  as  a  function  of  q  =  min{u, 7/t  -  «}. 
(Considered  as  a  continuous  function  of  q,  analysis  of  derivatives  reveals  fm(q)  to  be  convex  over 
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[0, 7n/2],  and  is  hence  maximized  at  the  endpoint  q  =  7)i/‘2.  Simple  algebra  shows  that  B,n(u)  < 
fin{ml‘2.)  =  (mlog7M)/2,  completing  the  induction.  Finally,  observe  that  the  same  argument  holds 
in  the  case  of  k  =  2  by  relaxing  the  requirement  that  the  dimension  j  ring  be  cut  into  lengths  of 
2  or  greater — there  is  only  one  cut  possible,  and  it  is  still  possible  for  a  node  in  Xa  or  to  have 
at  most  one  edge  between  X^  and  X(,.  Finally,  observe  that  when  rn  =  2-’  ,  j  <  n,  the  bound  is 
achieved  by  any  set  A  that  forms  a  j-dimensiona!  hypercube  in  ■ 

Another  bound  is  also  useful.  We  will  say  that  set  A  is  nowheix  completed  if  A  contains  no 
completed  rows,  i.e.,  no  dimension  j  for  which  there  are  k  nodes  whose  base-A;  indices  all  agree 
except  in  dimension  j. 

Lemma  3  Let  A  be  any  set  of  nodes  in  N^^n,  k  >  2,  with  |/1|  =  m  such  that  A  is  nowhere 
completed.  Then  Int{A)  <  n(m  -  This  bound  is  achieved  whenever  k  is  divisible  by  q, 

and  rn  =  {k/qy\ 

Proof:  By  observation  1,  maximizing  Int{A)  is  equivalent  to  minimizing  Ext(A);  we  seek  a  set  A' 
with  m  nodes  minimizing  Ext{A').  A'  must  be  connected,  otherwise  we  could  always  find  a  node  set 
with  smaller  external  edge  count  by  translating  a  connected  component  Unearly  through  until 
it  eliminates  one  or  more  external  edges  by  becoming  adjacent  to  another  connected  component. 
Now  represent  the  set  as  a  “Manhattan  polyhedron”  (every  face  is  parallel  to  some  axis)  formed  by 
a  collection  of  unit  cubes  in  7^’*,  each  cube  representing  one  node,  and  two  cubes  sharing  a  face  if 
there  is  an  edge  between  the  nodes  they  represent.  Figure  3  illustrates  this  construct.  The  number 
of  external  edges  is  thus  equal  to  the  number  of  exposed  faces — the  surface  area  of  the  Manhattan 
polyhedron.  Now  the  surface  area  Sm  of  any  Manhattan  polyhedron  in  H"  is  at  least  as  large  as 
that,  say  Sr,  of  the  smallest  “orthogonal  polyhedron”  (a  rectangular  solid  in  72")  that  completely 
encloses  it.  Let  v  >  m  be  the  volume  of  this  orthogonal  polyhedron.  The  polyhedron  with  volume 
V  forming  a  perfect  cube  in  72"  has  surface  area  Sc  <  Sr-  But  the  orthogonal  polyhedron  with 
volume  m  forming  a  perfect  cube  in  72"  has  smaller  surface  area  yet.  This  minimal  surface  area 
is  2nm^"~^^/"  <  Ext{A).  The  claimed  bound  on  Int{A)  follows  from  observation  1.  Furthermore, 
whenever  k  is  divisible  by  q,  and  m  =  (A'/q)"  we  can  construct  a  {k/q)  x  {k/q)  x  •••{k/q)  cube 
with  exactly  m  nodes,  in  which  case  the  bounds  are  exact.  I 

Our  optimality  results  hold  when  the  number  of  nodes  in  each  partition  set,  m  =  A"/p,  is 
small  enough  to  ensure  that  the  optimal  partition  sets  are  nowhere  completed.  Since  some  internal 
edges  are  gained  by  forming  a  completed  row  (due  to  wrap-around),  simple  extensions  to  geometric 
arguments  like  those  of  Lemma  3  are  not  sophisticated  enough  to  analyze  these  tradeoffs.  However, 
a  simple  argument  shows  that  for  sets  of  size  m  <  k,  the  configuration  minimizing  external  edges 
need  not  have  any  completed  rows. 

Lemma  4  For  all  k  >  2"  and  n  >  2  there  exists  a  nowhere  completed  subset  of  k  nodes  in 
with  minimal  external  edges. 

Proof:  When  k  >  2"  (and  n  >  2),  the  single  configuration  of  k  nodes  that  completes  a  row  has 
exactly  2A(7i—  1)  external  edges,  whereas  the  proof  of  Lemma  3  shows  that  the  set  of  k  nodes  which 
is  as  cubelike  as  possible  has  no  more  than  external  edges.  Now  27iA*"“’^'^"  <  2k{7i  -  1) 


Node  set  in  3d  cube 
External  edges  are  highlighted 


Manhattan  polyhedron 
Exposed  faces  represent  external  edges 


Figure  3;  Geometric  interpretation  of  a  connected  node  set 


if  and  only  if  (l/k)  <  (1  —  !/«)“•  But  1/A;  <  0.25  for  all  >  4,  and  (1  —  1/n)”  increases  monoton- 
ically  in  «  (converging  to  and  (1  —  1/2)^  =  0.25.  ■ 

Proofs  that  optimally  configured  sets  of  size  m  >  k  may  be  nowhere  completed  are  beyond  the 
scope  of  this  note.  However,  we  can  put  a  lower  bound  on  Ext{A)  for  |i4|  >  k,  and  analyze  the 
relative  error  of  this  bound. 

Lemma  5  Let  k  >  4.  For  all  m  >  2"  and  n  >  2,  let  E,n,n  ^  the  mimnial  value  of  Ext{A)  among 
all  node  sets  A  with  |y4(  =  m.  Then 


£„..n  <  2n7n("-’)/". 


Proof;  The  upper  bound  follows  from  the  observation  that  among  all  sets  A  that  are  nowhere 
completed,  is  an  upper  bound  on  Ext{A).,  and  thus  on  fi’m.n-  The  lower  bound  follows 

by  subtracting  from  this  the  maximum  number  of  external  edges  that  may  be  deleted  by  completing 
a  row — two  per  possible  row.  I 

Now  the  relative  difference  between  the  upper  and  lower  bound  is  1  — which  increases 
in  m.  Values  of  m  we  are  most  interested  in  derive  from  equi-partitions  where  every  dimension  is 
sliced  identicaUy.  Let  q  divide  k  evenly,  and  let  m  =  (k/q)'*.  In  this  case  the  relative  difference  is 
1  —  0.b/{qii).  Consequently  the  bounds  become  tighter  with  increasing  dimension  size,  n,  and  with 
decreasing  partition  size  set  (fc/q)". 

Let  A  be  any  set  of  nodes  with  |i4|  =  m.  From  the  observations  above  we  see  that 
C'i(m)  =  wm  +  Tfcmn  -  mlogm  <  C(A)  for  all  m  =  1,2,  ■  •  • ,  k’\ 

and 

C'iim)  =  wm  +  Tkmn  —  n{m  —  <  C(A)  '  for  aU  m  =  1, 2,  •  •  • ,  k. 

Observe  that  C'2(Tn)  is  monotone  non-decreasing,  as  ^C2{m)  >  0.  Another  result  describes  the 
relationship  between  C\  and  C'2. 


Lemma  6  For  all  m  €  6'i(7n)  >  62(7/1).  For  all  m  >  2",  6'i(7/i)  <  ('iivi). 

Proof:  Analysis  of  derivatives  with  respect  to  in  shows  that  6'j(l)  >  62(1);  since  f  'i(i)  =  62(1) 
we  infer  that  initially,  for  x  >  1,  6'i(x)  >  6'2(x).  Since  both  functions  are  continuous  this  domi¬ 
nance  is  maintained  until  the  first  m  such  that  6’i(7/i)  =  62(7/1).  Algebra  shows  that  the  unique 
solution  m  >  1  is  7/1  =  2".  At  this  point  6’((2’‘)  <  62(2”),  and  the  dominance  reverses.  ■ 


4  Analysis  of  Cost  Function 

Since  both  6'i(7n)and  62(7/1)  are  lower  bounds  on6''(7/i),  the  function  63(7//)  =  max{6'i  (7//),  62(7/1)} 
is  a  better  composite  bounding  function.  Previous  observations  have  estabUshed  that 


C'zim) 


C\{m)  for  m  <  2” 
62(7/1)  for  7/1  >  2’* 


Furthermore,  it  is  not  difficult  to  show  that  63(7//)  is  concave  over  m  6  [1,2”],  and  that  63(7/1)  is 
increasing  over  in  6  [2”,  A:"].  Furthermore  we  also  know  that  when  k  >  4,  63(7/1)  is  a  lower  bound 
on  the  cost  of  node  set  A  with  |A|  =  7/1  <  A:  elements. 

Our  strategy  now  is  to  identify  values  of  m  <  k  for  which  it  is  possible  to  partition  A*,.,,,  into 
A:”/7/i  isomorphic  subgraphs,  such  that  6'(7/i)  =  63(7/1).  Since  63(7/1)  is  known  to  be  increasing  for 
m  >  2",  we  determine  conditions  under  which  63(7//)  is  increasing  over  [1,2’*|.  (Considered  as  a 
continuous  function,  the  first  derivative  of  63(7/1)  for  m  6  [1,2”]  is 

-^C\(m)  =  w  +  Tkii  -  log  771  -  1/ln  2. 
dm 

This  function  decreases  in  in,  and  so  will  be  non-negative  over  [1,2"]  if  it  is  non-negative  at  in  =  2". 
The  latter  condition  is  satisfied  whenever  w  -f  7/(7*  —  1)  >  1/ln  2.  Thus 

Lemma  7  If  w  >  1/ln  2  orifk  >  2  andn  >  I,  tlienC^im)  is  everywhere  monotone  non-decreasing 
over  [1,A:”]. 

Monotonicity  of  63(7/1)  can  be  exploited,  for  if  node  sets  Pq,..  Fp_i  have  sizes  7//0,  •  •  •,7/ip_i, 
then  max{63(7//o),. .  .,63(7/ip_i)}  is  minimized  when  the  node  sets  have  equal  sizes.  To  complete 
the  analysis  we  simply  identify  conditions  on  p  that  ensure  that  C^im)  =  C{Pi)  for  all  /  =  0, ... ,  p— 
1,  and  that  can  be  partitioned  into  isomorphic  node  sets  with  this  cost.  Such  partitions  must 
be  optimal. 


Theorem  8  The  following  are  optimal  partitions  of  Nk,n  with  respect  to  the  bottleneck  cost. 

•  If  some  condition  of  Lemma  7  is  satisfied,  k  is  even,  and  p  =  A‘"/2-’  with  j  <  11,  then 
may  be  partitioned  into  isomorphic  hypercubes  of  dimension  j . 

•  If  some  condition  of  Lemma  7  is  satisfied,  theie  is  integer  q  such  that  {klqY^"'  is  integer  and 
p  =  (Ai/g)!”"')/”,  then  Nk,n  7/iaj/  be  partitioned  into  isomorphic  blocks  of  shape  (A;/^)*/”  x 
(A:/9)'/”  X  •••X  (A;/7)*/”.  ’ 
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The  partitions  identified  by  this  theorem  are  quite  intuitive.  They  divide  Ni;,u  uniformly  into 
equally  sized  sets  of  nodes,  and  the  nodes  in  a  set  are  clustered  tightly.  If  the  number  of  nodes  in  the 
set  is  less  than  2",  the  nodes  form  a  hypercube  of  some  dimension  no  greater  than  h.  If  the  number 
of  nodes  exceeds  2"  (but  is  no  greater  than  k),  they  form  a  perfect  cube  in  an  7i-(limen.sional  space. 
However,  while  these  optimal  partitions  are  intuitive,  we  have  already  seen  that  perfectly  balanced 
partitions  need  not  be  optimal.  It  is  also  noteworthy  that  the  requirement  on  w  for  optimality 
disappears  when  p  is  small  enough  (p  <  or  when  k  >  2. 

A  final  result  addresses  the  fact  that  restricting  the  number  of  nodes  per  processor  to  k  or  fewer 
may  be  overly  conservative.  For  k  <  vi  <  (k/2y'  we  can  bound  the  deviation  from  optimal  of  cubic 
equi- partitions. 

Lemma  9  Let  q  divide  k  evenly,  and  consider  the  partitioning  into  adjacent  blocks  of  size  (k/q)  x 
■■■{k/q).  Then  the  bottleneck  cost  is  no  more  than  l00/{nq)%  laiger  than  optimal. 

Proof:  Using  m  =  (k/q)  Lemma  5  shows  that  the  increase  in  external  communication  cost  of  the 
cubic  partition  is  no  more  than  100/(77^')%.  ■ 


5  Conclusions 

k-nry  7i-cubes  are  regular  graph  structures  that  are  found  in  numerous  contexts,  especially  in 
descriptions  of  communication  networks.  Partitioning  of  such  graphs  is  a  problem  that  arises  in 
network  design,  and  in  parallelized  simulation  of  such  networks.  This  paper  examines  the  problem 
of  identifying  optimal  partitions  of  Nk,n  with  respect  to  the  bottleneck  metric.  Our  investigations 
identify  two  points  of  interest.  First,  existing  work  on  partitioning  regular  graphs  for  parallel 
processing  has  used  a  subtly  different  measure  of  communication,  which  leads  to  very  dilferent 
results  than  ours.  Secondly,  while  the  partitions  we  identify  as  optimal  are  intuitive,  we  show 
by  example  that  equi-partitions  need  not  always  be  optimal.  Our  results  then  help  to  delineate 
problems  with  intuitive  optimal  partitions  from  those  with  non-intuitive  optimal  partitions. 

Open  remaining  problems  that  we  are  pursuing  include  dealing  more  conclusively  with  the  effect 
of  completing  rows,  and  with  determining  the  minimal  value  of  w  ensuring  that  equi-partitions  are 
optimal. 
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